{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# 基于前 30 个重要特征的模型训练\n",
    "\n",
    "在本 Notebook 中，我们将：\n",
    "\n",
    "1. 加载包含前 30 个重要特征的预处理数据\n",
    "2. 使用随机森林算法训练模型\n",
    "3. 将训练好的模型保存到指定目录\n",
    "\n",
    "这个步骤是在先前已提取的特征重要性基础上进行的，有助于我们简化模型，提高训练效率和预测精度。\n"
   ],
   "id": "3c27486fa0092bb3"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "initial_id",
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# 导入必要的库\n",
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "import joblib\n",
    "import os\n",
    "\n",
    "# 定义前 30 个重要特征的列表\n",
    "TOP_30_FEATURES = [\n",
    "    'Packet Length Variance', 'Packet Length Std', 'Max Packet Length', 'Subflow Bwd Bytes',\n",
    "    'Avg Bwd Segment Size', 'Average Packet Size', 'Packet Length Mean', 'Bwd Packet Length Max',\n",
    "    'Total Length of Bwd Packets', 'Bwd Packet Length Std', 'Init_Win_bytes_backward',\n",
    "    'Flow Bytes/s', 'Min Packet Length', 'Bwd Packet Length Mean', 'Flow IAT Max',\n",
    "    'Total Length of Fwd Packets', 'Fwd Packet Length Mean', 'Flow Duration', 'Flow IAT Min',\n",
    "    'Init_Win_bytes_forward', 'act_data_pkt_fwd', 'Total Fwd Packets', 'Fwd Header Length',\n",
    "    'Subflow Fwd Bytes', 'ACK Flag Count', 'Avg Fwd Segment Size', 'Flow IAT Std',\n",
    "    'Bwd IAT Mean', 'Fwd IAT Min', 'Idle Max'\n",
    "]\n"
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 单元格解释：\n",
    "在此单元格中，我们导入了需要的库，包括 `pandas`、`sklearn` 的 `RandomForestClassifier` 和 `joblib`，并定义了一个列表 `TOP_30_FEATURES`。这个列表包含我们之前提取的前 30 个重要特征，这些特征将被用于训练模型。\n"
   ],
   "id": "2d35cf042977c32c"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 定义加载预处理数据的函数\n",
    "def load_preprocessed_data(data_dir):\n",
    "    # 读取并筛选包含前 30 个特征的训练数据\n",
    "    X_train = pd.read_csv(os.path.join(data_dir, 'X_train.csv'))[TOP_30_FEATURES]\n",
    "    # 读取标签数据，并将其转换为一维数组格式\n",
    "    y_train = pd.read_csv(os.path.join(data_dir, 'y_train.csv')).values.ravel()\n",
    "    return X_train, y_train\n"
   ],
   "id": "38fcab973777a41c"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 单元格解释：\n",
    "这个单元格定义了 `load_preprocessed_data` 函数，用于加载训练数据。我们从指定的 `data_dir` 路径中读取 `X_train.csv` 并选择前 30 个特征，同时读取 `y_train.csv` 中的标签数据。`y_train` 被转换为一维数组格式，确保其与模型输入的格式一致。\n"
   ],
   "id": "e7e23e018f8142fa"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 指定数据和模型保存路径\n",
    "data_dir = '../data/processed'\n",
    "model_dir = '../models'\n",
    "\n",
    "# 加载预处理后的数据\n",
    "X_train, y_train = load_preprocessed_data(data_dir)\n"
   ],
   "id": "27cec500cd24eec5"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 单元格解释：\n",
    "在这里，我们指定了数据目录 `data_dir` 和模型保存目录 `model_dir`，然后调用 `load_preprocessed_data` 函数来加载 `X_train` 和 `y_train` 数据，这些数据仅包含前 30 个重要特征及其对应的标签。\n"
   ],
   "id": "e3ae0aa53f916ef1"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 训练随机森林模型并使用所有 CPU 核心\n",
    "rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)\n",
    "rf_model.fit(X_train, y_train)\n"
   ],
   "id": "b01a00adcae82654"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 单元格解释：\n",
    "这一单元格定义并训练随机森林模型。`RandomForestClassifier` 设置了 100 颗决策树、随机种子 42，并使用 `n_jobs=-1` 以充分利用所有可用的 CPU 核心加速训练。训练完成后，模型被存储在 `rf_model` 变量中。\n"
   ],
   "id": "aab50fcbf48bc7ab"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "# 保存训练好的模型\n",
    "os.makedirs(model_dir, exist_ok=True)  # 创建模型保存目录（若不存在）\n",
    "joblib.dump(rf_model, os.path.join(model_dir, 'random_forest_model_top30.pkl'))\n",
    "print(\"模型训练并保存成功\")\n"
   ],
   "id": "942d2b40c3732fb9"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 单元格解释：\n",
    "此单元格将训练好的随机森林模型保存到磁盘。我们使用 `joblib.dump` 将模型保存到指定的 `model_dir` 目录下，文件名为 `random_forest_model_top30.pkl`。`os.makedirs(model_dir, exist_ok=True)` 确保目录存在，若不存在则自动创建。最后，打印消息确认模型已成功训练和保存。\n"
   ],
   "id": "26531257effc9b75"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
